Automatic Genre Classification for Resource Scarce Languages
نویسندگان
چکیده
In this article we present research on the development of automatic genre classification systems for resource scarce languages. The main approaches to text classification from literature are presented and weighed against each other during an experimental phase, to identify the most appropriate text classification approach to be used as a genre classification system. A fixed feature set is extracted for seven classes from the available training data for each of the six languages under scrutiny and paired with each classification algorithm in order to test the algorithms’ performance. The algorithm showing the best results is support vector machines, in conjunction with term frequency and inverse document frequency features.
منابع مشابه
Cross-Lingual Genre Classification for Closely Related Languages
Resource-scarcity is a topic that is continually researched by the HLT community, especially for the SouthAfrican context. We explore the possibility of leveraging existing resources to help facilitate the development of new resources for under-resourced languages by using cross-lingual classification methods. We investigate the application of an Afrikaans genre classification system on Dutch t...
متن کاملSentiment Classification in Resource-Scarce Languages by using Label Propagation
With the advent of consumer generated media (e.g., Amazon reviews, Twitter, etc.), sentiment classification becomes a heated topic. Previous work heavily relies on a large amount of linguistic resources, which are difficult to obtain in resource-scarce languages. To overcome this problem, we investigate the usefulness of label propagation, which is a graph-based semi-supervised learning method....
متن کاملشناسایی خودکار سبک موسیقی
Nowadays, automatic analysis of music signals has gained a considerable importance due to the growing amount of music data found on the Web. Music genre classification is one of the interesting research areas in music information retrieval systems. In this paper several techniques were implemented and evaluated for music genre classification including feature extraction, feature selection and m...
متن کاملCorpusCollie - A Web Corpus Mining Tool for Resource-Scarce Languages
This paper describes CORPUSCOLLIE, an open-source software package that is geared towards the collection of clean web corpora of resource-scarce languages. CORPUSCOLLIE uses a wide range of information sources to find, classify and clean documents for a given target language. One of the most powerful components in CORPUSCOLLIE is a maximum-entropy based language identification module that is ab...
متن کاملClassifying Web corpora into domain and genre using automatic feature identification
Texts in representative corpora are typically classified into their domain and genre. However, it is not clear if existing domain and genre typologies can be applied at all to unlabeled data collected from the Web, for instance, to results of crawling. This study attempts to establish the most suitable categories for describing domains and genres of arbitrary web texts and to estimate the accur...
متن کامل